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The automatic keyphrases extraction (AKE) of a document is any expression 
by which we can learn its content without having to read it. Keyphrases are 
exploited in natural language processing (NLP) applications. These phrases 
are often mentioned in the document but there may be some keyphrases that 
are not mentioned. In the field of AKE, researchers have exploited many 
techniques, such as statistical calculation, deep learning algorithms, graph 
representation, and sentence embedding techniques. Approaches that exploit 
embedding techniques calculate the similarity between a document and a 
candidate keyphrase, where similar phrases to the document are considered 
as keyphrases. Representing the document by a single vector makes its 
performance poor, especially in long documents. This is in addition to the 
inability of these methods to generate absent keyphrases. In order to 
overcome these problems, our paper proposes an unsupervised approach to 
AKE, based on the universal sentence encoder (USE) to represent candidate 


keyphrases and parts of the document probably containing keyphrases. Our 
method also generates keyphrases not mentioned in the text. We compared 
the performance of the proposed approach with other methods based on 
embedding techniques, where the results showed the superiority of our 
approach especially in long documents. 
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1. INTRODUCTION 

The quantitative production of digital documents forced the search for solutions that could 
summarize or analyze their content without having to read them [1]. Automatic keyphrase extraction (AKE) 
remains the best way to solve this problem. According to Papagiannopoulou and Tsoumakas in [2] four 
techniques were exploited in unsupervised AKE methods, namely statistics-based, graph-based ranking, 
language model-based and sentence embeddings. The sentence embedding technique [3] is one of the recent 
methods used by researchers to represent the document and the candidate phrases in order to measure the 
semantic similarity between them and to consider the phrases closest to the document as keyphrases. The 
performance of most AKE methods using embedding techniques remains low, especially in long documents, 
which contain information that is not relevant to the document's topics, so the vector representing the 
document does not reflect its content, making the semantic similarity value between the document and 
candidate keyphrases inaccurate. Keyphrases often express the document's title, abstract, or conclusions. So, 
this factor must be taken into account when measuring the similarity between the document and the candidate 
keyphrases. 
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The objective of this paper is to propose an unsupervised AKE method based on a technique of 
sentence embedding, which takes into account the localization of the keyphrases in the document during the 
process of measuring the similarity of the candidate keyphrases with the content of document. According to 
Ajallouda et al. [4], the best technique that can be used to represent noun phrases is universal sentence 
encoder (USE) [5]. Since candidate keyphrases are noun phrases, we have chosen this technique to represent 
candidate phrases and parts of the document that may include keyphrases. In order to calculate the more 
precise semantic similarity, we used weighting coefficients for these parts, taking into account the proximity 
of the candidate phrases to these parts. The proposed approach not only extracts keyphrases contained in the 
document, but also generates keyphrases which are not mentioned in the document. 

We have organized the paper as follows. Section 2 presents related works. Section 3 is dedicated to 
presenting the proposed method for extracting and generating keyphrases from a document. Section 4 
includes the results and discussion of the performance of the proposed approach in extracting the present 
keyphrases and generating the absent keyphrases. Section 5 includes conclusions and future directions for 
research. 


2. RELATED WORKS 

This section will allow us to present relevant work in the proposed method. We will first introduce 
the most common sentence embedding techniques and mention the advantages of each technique. We will 
then introduce the most important AKE methods, especially those based on embedding techniques. Next, we 
will present the methods that generate absent keyphrases. 


2.1. Sentence embedding techniques 

Text data is more complex than other types of data, which makes it difficult for the machine to 
handle. Sentence embedding technique is the ideal way to convert text of different lengths into vectors of the 
same dimension. Most of the studies that were concerned with text embedding methods classified these 
methods into: Methods that employ the technique of calculating the average vectors of the words constituting 
the sentence, such as smooth inverse frequency (SIF) [6] and geometric embedding (GEM) [7] where this 
resultant vector is considered representative of the sentence. Although this method is simple, the sentence 
loses its semantics due to neglect of word order. This makes sentence embedding less semantically accurate. 
Thanks to the advent of encryption software and deep learning algorithms, this problem has been overcome. 
InferSent [8] is a recurrent neural network (RNN) based embedding technique that predicts semantic 
relationships between sentences. Universal sentence encoder [5] is a transformer-based method for 
embedding text, and it can also be employed by the deep average network (DAN) that gives better results in 
short texts. Sentence bidirectional encoder representations from transformers (SBERT) [9], this technique 
employs BERT [10] in addition to a Siamese grid for string embedding. In general, although these techniques 
give excellent results, compared to traditional methods, their disadvantage is that they are computationally 
expensive and require a great effort for training. Keyphrases are rare in some documents such as biomedical 
documents [11]-[13], which is reflected in the performance of methods using embedding techniques. 
Therefore, it is necessary to generate keyphrases instead of extracting them into biomedical documents. 


2.2. Keyphrases extraction approaches 

Several AKE methods have been published in recent years. Siddiqi and Sharan [14], there are 
supervised methods that treat the problem of keyphrase extraction as binary classification, and unsupervised 
methods that rely on extracting keyphrases based on their order. There are also semi-supervised methods, 
which are a combination of the two previous approaches. Some studies have categorized these methods by 
the type of technique used to extract keyphrases. Papagiannopoulou and Tsoumakas [2], there are approaches 
that exploit binary classification algorithms such as [15], [16]. Most of these methods are supervised. The 
statistical model has also been exploited in some methods such as [17]-[19]. Most of them are unsupervised. 
The most famous AKE methods are those that rely on graph techniques such as [20]-{22]. One of the recent 
methods that has achieved the best performance is the methods that use deep learning algorithms such as 
[23]-[25]. The development of sentence embedding techniques also contributed to the emergence of AKE 
methods that use these techniques as [26]—[28]. 


2.3. Keyphrase generation approaches 

In some documents, some key phrases may not be mentioned in the document. Therefore, it is not 
enough to extract only the keyphrases mentioned. For this, some AKE methods generate keyphrases that are 
not mentioned in the document. Crawshaw [29], the authors propose to generate keyphrases using an encoder 
that predicts the semantic meaning of a phrase through the recurrent RNN algorithm. This method created a 
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few problems, the biggest of which was creating phrases with the same meaning. To overcome this problem, 
the authors of [30] suggested using correlational recurrent neural networks (CorrRNN), to generate 
keyphrases covering the topics of the document. The downside of this method is the huge amount of data 
used for training. To reduce this amount, the authors of [31] suggested creating a topic-based adversarial 
neural network (TANN) that uses both labeled and unlabeled data to reduce the amount of data used in 
training. Pang and Lee [32] also suggested exploiting the following and preceding context of a sentence to 
predict key phrases, using the bidirectional long short term memory (Bi-LSTM) RNN. Rabby et al. [33] 
proposed a model using the seq2seq RNN, which extracts existing keyphrases and predicts the not mentioned 
in the document by exploiting linguistic, semantic and statistical information. Nguyen and Kan [34] also 
suggested a method that uses the keyphrases mentioned in the text, in order to construct keyphrases not 


mentioned in the text using the mask-predict method. 


3. KEYPHRASE EXTRACTION AND GENERATION METHOD 
3.1. Process of the proposed method 

The process of our method proposed in this article consists of six main steps, as shown in Figure 1. 
Where in the first step we extract the candidate keyphrases, to achieve this we adopted the approach proposed 
in [35]. The second step is dedicated to embedding the candidate keyphrases by the transformer model, while 
the paragraphs of the document will be embedded by the deep average network (DAN) model. The third step 
is to identify parts of the document that may contain keyphrases. The fourth step will be devoted to 
measuring the similarity between the candidate key phrases and the paragraphs of the document, as well as 
the extraction of the keyphrases mentioned in the text, while the fifth step will be devoted to the generation of 
keyphrases not mentioned in the document. We will conclude the process by deleting phrases with the same 
semantic meaning. 
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Figure 1. Present and absent keyphrases extraction process 


3.2. Candidate keyphrases 

One of the challenges faced by AKE methods is identifying candidate keyphrases in a document. 
For this, many techniques have been used such as term frequency-inverse document frequency (TFIDF), N- 
Gram, and part-of-speech tagging (POST). The method which has been proposed in [35] gives acceptable 
results. Figure 2 presents the different steps in order to select candidate keyphrases. Our method adopted the 
same process for identifying candidate key phrases. This made us get rid of many unimportant phrases in the 
document. 


3.3. Sentence and paragraph embedding 

The keyphrase extraction method that we propose in this paper is based on the similarity calculation 
between the candidate keyphrases and the paragraphs that will be selected in the third step. The calculation of 
the semantic similarity between two texts requires their vectorial representation. Several sentence embedding 
techniques can be used to represent paragraphs, and candidate keyphrases. Ajallouda et al. [4], noun phrases 
are best represented using the universal sentence encoder (USE) technique. We chose to use this technique 
because the candidate keyphrases are noun phrases. 

The universal sentence encoder is a recent technique used to represent a sentence or a paragraph by 
a vector of 512 dimensions. First, USE encodes the phrases using an encoder that converts these phrases into 
vectors, which are used in some NLP tasks. The incorrect results obtained from these tasks are exploited in 
order to improve the vector representation of the phrases. Figure 3 shows USE process. 
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Figure 2. Candidate keyphrase selection process 
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Figure 3. Universal sentence encoder process 


The encoder process can be done in USE either by a transformer of 6 layers, each one containing a 
self-attention model and a feed forward network that enables the transformer to exploit the context and the 
word order during the embedding process. Encoding can also be accomplished via the deep average network 
(DAN) [36]. This is done by generating the average vectors of the words that make up the phrase. This vector 
passes by a four-layers deep neural network (DNN) that produces a vector with 512 dimensions. Figure 4 
presentes the encoder exploited by USE. 
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Figure 4. Transformer encoder and deep average network process 


Empirical results conducted on USE in 6 datasets (See Table 1) confirmed that transformer encoding 
performs better than DAN encoding. On the other hand, it is difficult to use the transformer to encrypt long 
texts because it requires more time than DAN. 
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Table 1. USE performance via transform model and DAN model 


Dataset Transformer model DAN model Task 
Customer reviews [31] 87.43 80.97 Product reviews 
Movie reviews [37] 81.44 74.45 Sentiment 
Multi-perspective question and answering [38] 86.98 85.38 Opinion polarity 
Stanford sentiment analysis [39] 85.38 77.62 Sentiment 
Subjectivity summarization [32] 93.87 92.65 Subjectivity/Objectivity 
Text retrieval conference [32] 92.51 91.19 Question and answering 


Which is pointed out by the USE authors who consider the complexity of transformer model to be 
O(n’), while DAN model is O(n). For this we have chosen in our method to use the model based on the 
transformer to encode the candidate keyphrases while the encoding of the paragraphs will be done via the 
model based on DAN. 


3.4. Score paragraph 

The purpose of this step is to identify the paragraphs that express the content of the document and to 
get rid of irrelevant paragraphs. To identify the paragraphs expressing the document, we started from the 
assumption that these paragraphs are semantically similar to most of the paragraphs of the document, in 
particular title, abstract and conclusion. For this, we first calculate the score of each paragraph using (1). 


Score(P;) = =[2 x (Sim(V, T) + Sim(V;, A) + Sim(V;, C)) + DIL, Sim(V;, Y))] (1) 


N: number of paragraphs in document 
Vi: vector represents paragraph i 
Vj: vector represents paragraph j 
T: vector represents the title 
A: vector represents the abstract 
C: vector represents the conclusion 

Recently, deep learning methods that calculate semantic similarity between texts have appeared 
[40]. The disadvantage of these methods is that their complexity is high, as well as the need to provide data 
for training. For this, we have chosen cosine as a measure of semantic similarity between two texts. Cosine 
measure is calculated using (2). 


; _ UV 
Sim(U, V) = Tami 2) 
ce Max(Score)+Min(Score) (3) 


2 


We have defined the threshold for selecting paragraphs similar to the document in (3). Each 
paragraph with a score greater than or equal to S, will be considered similar to the document. This formula 
gave us better results than using all the paragraphs or adopting the scores average as a selection threshold. 


3.5. Keyphrase extraction 

In the previous paragraph, the paragraphs expressing the document were identified. Therefore, the 
candidate key phrases that are most semantically similar to these paragraphs are the keyphrases. In order to 
calculate the proximity of each candidate keyphrase to the paragraphs representing the document, we used 


(4). 


Si Score(Pj)xSim(U;V j) 
5M Score(Px) 


SKP(CP;) = (4) 


Cp;: candidate keyphrase i. 

Score(P;): score of paragraph Pj, calculated via formula 1. 

Sim(Uj, Vj): similarity between Uj, vector of candidate keyphrase CP; and Vj, vector of paragraph Pj. 
M: the number of paragraphs expressing the document. 

Candidate keyphrases will be ranked according to results of (4). The phrases with the highest score 
will be considered present keyphrases. We experimentally determined the number of appropriate keyphrases. 
These experiments will be presented in the results section. The selected keyphrases will be used in the absent 
keyphrase prediction process. 
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3.6. Keyphrase generation 

When reading and analyzing a document, we often find that there are sentences that can clearly 
express its content even though they are not mentioned in the document. These phrases are called absent 
keyphrases. Also, analysis of datasets used in training and evaluation of AKE methods confirms this, as we 
find that almost half of the keyphrases in the document are absent keyphrases. Table 2 shows the percentage 
of absent key phrases in the most popular datasets. 


Table 2. The percentage of present and absent keyphrases in some datasets 


Dataset Type Docs Present KP (%) Absent KP (%) 
SemEval Papers 244 42.60 57.40 
NUS Papers 211 54.40 45.60 
Krapivin Papers 2 304 55.70 44.30 
KPTimes News 259 923 58.80 41.20 
KP20K Abstracts 527 090 62.90 37.10 
Inspec Abstracts 1000 73.60 26.40 


This fact forced us to improve our approach to make it able to predict the absent keyphrases. For 
this we used the RNN encoder-decoder model of [41], [42]. This model (also called Seq2Seq) is used by deep 
learning methods that predict keyphrases [23], [43], [44]. However, most of these methods generate absent 
keyphrases but some do not express the content of the document. Our method mitigates this problem by 
replacing the source document in the form with the main paragraphs, which were identified in the third step. 
This is in addition to exploiting the previously defined keyphrases. This eliminated irrelevant texts and 
reduced the amount of model training data. 


3.7. Keyphrases selection 

To avoid selecting duplicate keyphrases, phrases that are part of other phrases, or irrelevant phrases. 
After the keyphrases extraction and generation process is completed, our method by Algorithm 1 filters all 
these phrases to remove duplicate keyphrases, that are part of other phrases, for example a more expressive 
computer sentence than computer or science. The algorithm also removes irrelevant phrases that do not 
express the document. 


Algorithm 1. Keyphrases selection 
Input: EKP, list of extracted keyphrases 
GKP, list of generated keyphrases 
Output: List of selected kyephrases 
Begin 
KPE [] 
// Remove duplicate keyphrases and parts keyphrases 
for i=0 to len (GKP) do 
for j=0 to len (EKP) do 
if (duplicate (GKP[i], EKP[j])) 
remove (GKP [i]) 


i=i-1 
else 
if (part (GKP[i], EKP[j])) // GKP[i] is part of EKP[j] or vice versa 
if (score (GKP[i]) < score (EKP[j])) 
remove (GKP[i]) 
i=i-1 
else 
remove (EKP[j]) 
Jerk 
end if 
end if 
end if 
end for 
end for 


// Remove irrelevant keyphrases 
EKP . append (GKP) 
KP€[] 
for i=l to len(EKP) do 
if (score (EKP[i]> =(soreMax+scoreMin) /2) 
KP. append (EKP [i] 
end if 
end for 
return KP 
end 
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This algorithm allowed us to select the list of keyphrases present or absent, the most expressive of 
the document. Our method will rank them according to their scores calculated using (4). The phrases that 
rank first are the keyphrases of the document. 


4. RESULTS AND DISCUSSION 

In the first part of the results and discussion section, we will present the tools that have been 
exploited to evaluate our method. We will first describe the data sets used in training and testing, as well as 
the evaluation metrics used. We will then introduce a method for selecting the paragraphs that express the 
document. In the second part, we will discuss the results obtained and compare them with the performance of 
other AKE methods. 


4.1. Datasets 

To evaluate AKE methods. The available datasets can be used. However, recurrent neural network 
training requires a dataset that has a huge number of documents. This is provided by the KP20k [23] dataset, 
which contains 527,830 articles, of which 40,000 posts are randomly selected, with 20,000 articles devoted to 
training while the rest are used for testing. That's why we chose this dataset to train the part of generating 
absent keyphrases. 

In addition to KP20k we will evaluate our method on the Inspec [45] dataset, containing abstract 
scientific papers which will enable us to measure performance in short texts. We will also use the 
Semeval2010 [46] and NUS [34] datasets, containing scientific papers to evaluate the performance of our 
method in long texts. 


4.2. Evaluation metrics 

Although several metrics are available to evaluate the performance of AKE methods [47]. However, 
most researchers prefer to use only three measures, recall (5) which expresses the number of keyphrases 
extracted from among the keyphrases of the document. Precision (6) which expresses the number of valid 
keyphrases extracted of the total keyphrases extracted. Fl-mesure (7) which is calculated to express 
interaction and to combine precision and recall. 


True Keyphrases 


Recall = ———— ern» —_—_ (5) 
KeyPhrases extracted by the author 
aot True KeyPhrases 
Precision = omer ie A eee (6) 
True KeyPhrases+False KeyPhrases 
2xPrecisionxRecall 
F1. Mesure = mnra (7) 
Precision+Recall 


Its ease of use and the precision of its results are what made most researchers prefer to evaluate 
AKE methods using these metrics. We will use these metrics in order to evaluate and compare the 
performance of our method in other ways. 


4.3. Paragraphs of the document 

Before the process of extracting the present keyphrases, the paragraphs that express the content of 
the document are selected, by calculating the score of each paragraph by (1). The paragraphs with the highest 
score are chosen as paragraphs expressing the document. To select the appropriate number of paragraphs, we 
tried three formulas, the first case is to keep all the paragraphs, the second case is to use formula 3 while the 
third case is to choose the paragraphs that have a score greater than the average score of all paragraphs, 
which is calculated by (8). 


AvgScore = a Score(P;) ®) 


N: number of paragraphs in document 
Score (P;): the score of paragraph i, calculated via (1). 


We applied these three cases to all datasets, where the results obtained (see Figure 5) showed that 
the second case outperforms the other cases in all datasets, which determines the appropriate number of 
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closest paragraphs of the document using (3). Which is the same number of paragraphs we used in the 
process of generating the absent keyphrases. 


F1@5 and F1@10 for NUS dataset F1@5 and F1@10 for Kp20k dataset 
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Figure 5. The result of F.Measure in different case of selection of paragraphs express the document 


4.4. Evaluation results 

Our method underwent two stages of evaluation. The first was limited to evaluating the performance 
of present keyphrases extraction and comparing it with the performance of methods that use the embedding 
technique to extract keyphrases. The second stage involved evaluating the performance of absent keyphrases 
generation and comparing it with the methods for generating absent keyphrases. 


4.4.1. Present keyphrases extraction 

To evaluate the performance of our method for present keyphrases extraction, we compared it with 
three methods that also adopt embedding techniques for keyphrase extraction. The first is EmbedRank [48] is 
an unsupervised method for extracting the present keyphrases. It embeds candidate keyphrases and the 
document using Sent2vec embedding technique [49]. The keyphrases are selected from among the candidate 
keyphrases that have the greatest cosine similarity to the document using the maximal margin relevance, to 
avoid repetition of extracting the same keyphrases. MDERank [28], an unsupervised method that uses BERT 
technique [50] to embed the document and its variants. The principle of MDERank is to create variants for 
the original document while masking some phrases in these variants. Semantic similarity is calculated 
between these variants and the original document. Masked phrases in the variant that achieve the least 
semantic similarity to the original document are of great importance to him. The third is KP-USE [51], which 
is an unsupervised method. It is based on dividing the document into five main parts. These parts and 
candidate phrases are embedded by the USE technique. The semantically similar phrases of these parts are 
keyphrases in the document. 
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Table 3 shows the results of the performance of our method compared to the performance of these 
methods in the dataset that we used. Each method extracted 5 keyphrases in the first phase and then 10 
keyphrases in the second phase. Our method excels in datasets containing long texts. On the other hand, there 
is convergence in performance with other methods in datasets that contain short texts. 


Table 3. KPEG performance for extracting present keyphrases compared with 3 methods 


Method NUS KP20k Inspec SemEval 
FI@5 F1@10 FI@5 F1@10 F1@5 F1@10 F1@5 F1@10 
EmbedRank 0.13 0.17 0.28 0.31 0.31 0.34 0.16 0.21 
MDERank 0.15 0.18 0.32 0.34 0.28 0.32 0.17 0.20 
KP-USE 0.07 0.08 0.25 0.28 0.22 0.26 0.15 0.20 
KPEG 0.33 0.28 0.30 0.33 0.30 0.35 0.26 0.35 


4.4.2. Absent keyphrases generation 

To measure how well our method generates absent keyphrases we select to compare its performance 
with that of two methods that generate absent keyphrases. The first is proposed in [23]. The method generates 
absent keyphrases using a model based on RNN encoder-decoder, enhanced by a copying mechanism [52] 
that enables the RNN to generate appropriate phrases from the source text. The second method is TG-Net 
[53], which considers that the title of the document has an essential role in the task of generating absent key 
phrases, since the title is used by TG-Net as an additional query in the input to the encoder-decoder, in 
addition to the document text, allows This form makes use of the information included in the title to create 
absent key phrases. Generating absent keyphrases is a very complex task. Where, we note that most methods 
are evaluated for their performance by generating ten or more keyphrases. So, we will evaluate our method 
based on its performance in generating 10 keyphrases in the first stage and then 50 keyphrases in the second 
stage. 

Table 4 presents the results of the recall metric as shown in (5), for the generation of absent 
keyphrases for our method and comparing it with the CopyRNN and TG-Net methods, in which we note that 
our method is able to obtain a recall average of approximately 10 %, which is a fairly acceptable average, 
especially if we consider the percentage of the present keyphrases that over 60% of all keyphrases. 


Table 4. KPEG performance for generating absent keyphrases compared with 2 methods 
NUS KP20k Inspec SemEval 

R@10 R@50 R@I10 F1@50 R@IO FI@50 FI@10 F1@50 

CopyRNN 0.06 0.12 0.13 0.21 0.05 0.10 0.04 0.07 

TG-Net 0.08 0.12 0.16 0.27 0.06 0.12 0.05 0.08 

KPEG 0.07 0.13 0.12 0.18 0.06 0.09 0.07 0.10 


Method 


4.5. Discussion 

The unsupervised approach proposed in this paper combines the extraction of keyphrases present in 
the document, and the generation of absent keyphrases. To extract the present keyphrases, we exploited the 
USE embedding technique to select the paragraphs expressing the document, as we considered that the 
phrases similar to these paragraphs are present keyphrases. While we used the RNN encoder-decoder model 
to generate absent keyphrases. Instead of the source text, we used as an input in this model the paragraphs 
expressing the document in order to avoid generating phrases that are irrelevant from the document and 
reduce the complexity of the model. The list of absent keyphrases generated is filtered to avoid duplication of 
keyphrases. The keyphrases, whether present or absent, are ranked according to the score of their proximity 
to the paragraphs, to select the N first phrases as keyphrases of the document. 

The results of our evaluation showed that our method performed well in extracting the present 
keyphrases compared to methods that used embedding techniques, especially in long documents. The 
evaluation in which we exploited four datasets also showed that our method has the ability to generate absent 
keyphrases, with a recall average over 10%. This result remains encouraging as we will improve it in the 
future by providing a larger dataset for training, especially for short texts. Most methods that generate absent 
keyphrases find it difficult to overcome the problem of phrase overlap and duplication. There are a number of 
solutions that have been proposed to overcome this problem, as [24] who proposed an automatic review to 
reduce duplicates. Zhao and Zhang [54] apply constraints to limit the generating of overlapped phrases. To 
overcome this problem, our method proposed Algorithm 1 that removes overlapped or duplicate phrases after 
extracting and generating keyphrases, whether they are present or absent. 
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5. CONCLUSION 

This paper presents an unsupervised method that combines the extraction of the present keyphrases 
and the generation of the absent keyphrases. We exploited the USE embedding technique to extract the 
keyphrases from the expressive paragraphs of the document, while the absent keyphrases were generated by 
exploiting the RNN encoder-decoder where we used it as an input for the expressive paragraphs of the 
document. We evaluated our method on four datasets namely NUS and SemEval containing long documents 
and Inspec and KP20k containing short texts. The results showed the superiority of our method in extracting 
the present keyphrases compared to other methods, especially in long documents. Also, the results we 
obtained in generating the absent keyphrases remain encouraging. Our method proposed a new algorithm that 
reduces the problem of overlap and duplicate keyphrases. In the future we will improve the performance of 
our method of generating absent keyphrases by providing a larger training dataset, especially for short texts. 
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